Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archiver for EIA RECS #534

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

nilaykumar
Copy link
Contributor

@nilaykumar nilaykumar commented Jan 22, 2025

Overview

Closes #519.

What problem does this address?
Implements archiver for EIA RECS.

What did you change in this PR?
Added archival of some of the 2020 .xlsx files.

Testing

How did you make sure this worked? How can a reviewer verify this?

pudl_archiver --datasets eiarecs \
    --initialize \
    --summary-file eiarecs-summary.json \
    --depositor fsspec \
    --deposition-path /tmp/eia-recs/

To-do list

Tasks

Preview Give feedback

# housing characteristics
{
"base_url": "https://www.eia.gov/consumption/residential/data",
"php_extension": "index.php?view=characteristics",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RECS webpage is a little bit tricky. It has tabs containing different sequences of datafiles (and PDFs, etc.) that can't be reached directly from the base_url. Instead we need to tack on these strings to the end of the url, which are different for different tabs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth defining the shape of this dictionary as a dataclass, but definitely not a blocking concern.

"prefix": "state-ce",
"pattern": re.compile(r"ce(\d)\.(\d{1,2})\.(.*)\.xlsx"),
},
# microdata
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skipping the microdata for now. Happy to come back and rewrite everything to be more general once I get a better sense from folks who have a sense of how deep we want to go down this rabbit hole.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - we can revisit later. I'm unaware of the context of this rabbit hole - assuming you all had a discussion during the hackathon? @cmgosnell

zip_path = self.download_directory / f"eia-recs-{year}.zip"
data_paths_in_archive = set()
# Loop through different categories of data (all .xlsx)
for pattern_dict in LINK_PATTERNS:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call get_hyperlinks multiple times: once for each tab of data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a fun webpage!

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks pretty good - provided some feedback on the output file name munging but nothing's blocking.

Your TODO list looks pretty good - though I'll make a draft in Zenodo just so we can have something at all archived ASAP.

src/pudl_archiver/archivers/eia/eiarecs.py Outdated Show resolved Hide resolved
zip_path = self.download_directory / f"eia-recs-{year}.zip"
data_paths_in_archive = set()
# Loop through different categories of data (all .xlsx)
for pattern_dict in LINK_PATTERNS:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a fun webpage!

"prefix": "state-ce",
"pattern": re.compile(r"ce(\d)\.(\d{1,2})\.(.*)\.xlsx"),
},
# microdata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - we can revisit later. I'm unaware of the context of this rabbit hole - assuming you all had a discussion during the hackathon? @cmgosnell

# housing characteristics
{
"base_url": "https://www.eia.gov/consumption/residential/data",
"php_extension": "index.php?view=characteristics",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth defining the shape of this dictionary as a dataclass, but definitely not a blocking concern.

2015 methodology required some changes to allow for downloading html files.
@jdangerx jdangerx requested a review from cmgosnell January 29, 2025 23:44
@jdangerx
Copy link
Member

jdangerx commented Jan 29, 2025

Discussed on Slack with Nilay and I'm taking this over the finish line.

Draft record (currently including 2020, 2015, 2009) on Zenodo

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the draft archive looks pretty good! i have a few small specific suggestions but in general when possible to make the link patterns more generic i'd suggest doing that so there aren't so so many to add for each year and view and sub-view file group.

its also kind of a bummer to not have get_resources be able to check if there is a new year of data. i had to poke around a lot to find somewhere for CBECS. I wonder if you could just use "https://www.eia.gov/consumption/residential/data/previous.php" and search for something like ^/consumption/residential/data/(\d{4})/$

2020: {
"housing_characteristics": LinkSet(
url=_url_for(year=2020, view="characteristics"),
short_name="hc",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using the view name (or some other something longer) instead of the short name in the file names so the files are more decipherable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea if it were me i'd condense these two LinkSet attributes as just view and use the view name in the filename and construct the page url with _url_for in the get_year_resources

Comment on lines +67 to +78
"microdata": LinkSet(
url=_url_for(year=2020, view="microdata"),
short_name="microdata",
pattern=re.compile(r"(recs.*public.*)\.csv"),
extension="csv",
),
"microdata-codebook": LinkSet(
url=_url_for(year=2020, view="microdata"),
short_name="microdata",
pattern=re.compile(r"(RECS 2020 Codebook.*v.)\.xlsx"),
extension="xlsx",
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not blocking but i would suggest making these a little more generic so there aren't so many to maintain

they both have the year and they both have recs if your add re.IGNORECASE and you could (\.csv|\.xlsx) for different file types. this is kinda a high level overall suggestion to condense patterns when possible but not a hill i will die on.


async def get_resources(self) -> ArchiveAwaitable:
"""Download EIA-RECS resources."""
for year in [2020, 2015, 2009]:
Copy link
Member

@cmgosnell cmgosnell Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for year in [2020, 2015, 2009]:
for year in YEAR_LINK_SETS:

or see my suggestion in the main coment

matched_filename = (
match.group(1)
.replace(".", "-")
.replace(" ", "_")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've typically attempted to always use dashes instead of _ for file names

Suggested change
.replace(" ", "_")
.replace(" ", "-")
.replace("_", "-")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

Write an archiver for EIA Residential Energy Consumption Survey
4 participants